Voter Segmentation

Using unsupervised machine learning to segment voters

Posted by Christian Payne on July, 2022

Given the recent talk about new Conservative leadership candidates and their potential to hold the 2019 Conservative coalition together, I've bee thinking about voters. Specifically I've been thinking about types of voters. Lots of voters focus on voters classed by their liklihood to vote, for the purposes of campaigning. But what about identifying groups of voters using more than just their propensity to vote.

Voter Data

The data we'll be using is the latest wave of the British Election Study (BES) panel study. This data contains interesting fields concerning, including the left-right alignment of voters. Below I've provided a plot showcasing the left/right alignment of voters by their vote in the 2019 general election.

plotly

The data also contains data on voter demographics as well as the respondant's best recollection of their previous voting behaviour. Altogether, the BES panel is an incredibly rich dataset for exploring voter behaviour.

K-means Clustering

For the cluster modelling, we'll be using the k-means clustering algorithm. In layman's terms, the algorithm works by placing k points throughout the data, called "centroids". Where 'k' is some number that we have to determine prior to fitting the algorithm. The points that are closest to that centroid are identified as being in that centroid's cluster. How much of the data is explained by those centroids is then calculated. Then the centroids move, and the process starts again. This continues until the amount of the data that is explained by the clusters is optimised. But how do we decide on how many centroids to start with?

To determine how many centroids we should use, the value of 'k', we will use an elbow plot. This plots how much of the data can be explained by an algorithm with k clusters. Obviously more clusters would increase the amount of the data the algorithm can explain. However, the marginal impact from adding another cluster reduces as we add more clusters. The optimal number of clusters would be where the value of k where there is a sudden change if the line, like an elbow.

In the plot above the change is relatively subtle, particularly as the plot is a little stretched. We'll be using 3 clusters for the algorithm we'll build.

Clustering Voters

With our value of k determined, we can then fit the algorithm. Visualising the clusters is a little difficult given the number of variables in our data. However, what we can do is use another algorithm called Principal Components Analysis (PCA). This reduces the number of columns by turning combinations of columns into "components" based on how much of the data they explain. This allows us to visualise the clusters we've identified a little easier. Below I've plotted the voter clusters with the largest two of these components, with the percentage of the data they explain in brackets.

I admit, it's a little difficult to see the difference with only two of these components. So I've put together a 3D representation of the clusters with a third component added.

rglWebGL

Meet the Clusters

Now that we have the clusters, I think it's time to actually look into the voters within each cluster to identify any common patterns.

Cluster 1: Liberal Cosmopolitans

Looking at the first cluster, we can see that voters in this cluster tend to be the more akin to the cosmopolitan voter typeology. Lookign at the graph below, the first cluster has the highest proportions for readers of the Guardian, university education, remain voters, as well as those considering voting for Labour at the next election. Cluster 1 is also the most economically secure, as measured by self-reported concerns around unemployment and poverty.

Finally, cluster 1 is also the largest cluster in our sample. With a membership encompassing around 64% of the voters in our sample.

Cluster 2: Tory Loyalists

Cluster is very similar to cluster 1 in many respects, as we can see from the above graph. Particularly with regard to measures of economic well-being. Indeed around 26.7% of voters in cluster 1 report a gross household income higher than £50,000 compared with 26.1% and 21.7% for clusters 1 and 4 respectively (although cluster 1 does edge cluster 2 out in the higher income brackets). However, one factor that seperates cluster 2 from the others is their staunch loyalty to the Conservative party. Below I've identified some more key features that stand out for cluster 2 and plotted them.

As the name would suggest, cluster 2 are ardent supports of the Conservatives. With the party enjoying large support from them in 2015 and 2019, although their support dipped in 2017. Cluster 3 also heavily voted for 'Leave' in 2016 and are the most Christian of the clusters. One final interesting piece of information that can be gleaned from this chart actually concerns cluster 3. Specifically it is interesting that the significant rise in the support of cluster 3 for the Tories is correlated with their sizable wins in 2015 and 2019. Indeed it is perhaps through mobilising this group that the Conservatives have paved their pathway to No. 10.

Cluster 3: The Politically Uninterested

The final cluster we have identified, cluster 3, is interesting with respect to a number of features. From an initial look cluster 3 stands out as the most economically insecure of the 3 clusters as described earlier, with only 1 in 5 voters in cluster 3 responding that they have a gross household income above £50,000 compared with over 1 in 4 for clusters 1 and 2. Cluster 3 is also the most ethnically diverse and least educated of the clusters. I've presented these insights in the graph below

Final Thoughts